5023B

Data Science for Biologists

Dr Philip Leftwich

About me

Associate Professor in Data Science and Genetics at the University of East Anglia.


Academic background in Behavioural Ecology, Genetics, and Insect Pest Control.


Teach Genetics, Programming, and Statistics

UEA logo

Introductions

Outline of the course

  • Advanced linear modelling
  • Power analysis
  • Data reproducibility
  • R programming
  • Machine Learning

Expectations

  • One workshop per week

  • One lecture per week

  • One assignment per week

  • One ‘capstone’ project

What to expect during this course


I hope you end up with more questions than answers!


Schitts Creek questions gif

Source: giphy.com

Reproducible Research

What is Reproducibility?

Introduction to Reproducible Research

Turing Way Community cc-by licence

  • For Research to be reproducible both data and methods should be available.

  • Applying the described methods to the data leads to the same results

Methods

  • In theory, method availability ≠ code

  • But with complex data and analyses - are methods of data collection enough?

Self-correcting Science?

  • Science advances incrementally by identifying and rectifying errors over time

  • Peer review: Critical evaluation of papers by experts maintain quality

  • Independent studies either support or fail to replicate findings

Self-correcting Science?

  • Publication bias: preference for positive results

  • Pressure to publish

  • Poor study designs and statistical issues

  • Lack of transparency

Reproducibility trial:

246 biologists get different results from same data sets

: Forest plots of meta-analytic estimated standardized effect sizes (Zr, blue triangles) and their 95% confidence intervals for each effect size included in the meta-analysis model. (A) Blue tit analyses: Points where Zr are less than 0 indicate analyses that found a negative relationship between sibling number and nestling growth. (B) Eucalyptus analyses: Points where Zr are less than 0indicate a negative relationship between grass cover and Eucalyptus seedling success

Reproducibility Crisis

  • The reproducibility crisis emerged when numerous studies, especially in fields like psychology, medicine, and biology, failed to be replicated by other researchers.

  • High-profile replication attempts revealed that many published results could not be consistently reproduced, raising doubts about their validity.

Crisis as an Opportunity

  • Recognition that no study should be considered ‘definitive’

  • Empower lasting systemic change through increased transparency in research methods, data sharing and reporting

  • Structural change in academic culture

Open Science

Open Science

Open science is a global movement that aims to make scientific research and its outcomes freely accessible to everyone. By fostering practices like data sharing and preregistration, open science not only accelerates scientific progress but also strengthens trust in research findings.

UKRN

  • UK Reproducibility Network - funded by UK Research Council

  • 46 member institutions (UEA is one)

  • Establish open research practices across UK Research

  • https://www.ukrn.org/

UKRN

Project management

Tidy projects

/home/phil/Documents/paper
├── abstract.R
├── correlation.png
├── data.csv
├── data2.csv
├── fig1.png
├── figure 2 (copy).png
├── figure.png
├── figure1.png
├── figure10.png
├── partial data.csv
├── script.R
└── script_final.R

Organised projects

  • README

  • Documented

  • Easy to code with

  • All files are inside the root folder

R projects

R projects

Slugs

  • A string of characters defining a file

What do you think are the contents of these files:

  • data/raw/madrid_minimum-temperature.csv

  • scripts/02_compute_mean-temperature.R

  • analysis/01_madrid_minimum-temperature_descriptive-statistics.qmd

Name files

Come up with good names for these:

  • a dataset of cats with columns for weight, length, tail length, fur colour(s), fur type and name.

  • a script that downloads data from Spotify.

  • a scripts that cleans up data.

  • a scripts that fits a linear discriminant model and saves it to a file.

R projects and clean slates

R projects

  • Use projects

  • Check your code runs on blank slates

Quarto

  • Automates the creation of a paper or report

  • Saves time

  • Reduces errors

copy-paste

(https://www.nature.com/articles/d41586-022-00563-z)

Git

copy-paste

Git repository

copy-paste

Git collab

copy-paste

Forking

copy-paste

Renv

copy-paste

copy-paste

Benefits

Week Two: Descriptive Statistics

Building Statistical Models

  • What is a Statistical Model?

  • A model is a simplified representation of real-world processes.

  • It helps us describe, explain, and predict outcomes.

  • A good fit makes accurate predictions; a poor fit can lead to misleading conclusions.

  • To make reliable inferences, the model must accurately represent the data.

“Later, we’ll see how models help us test hypotheses using p-values to assess if the data fits our expectations.”

Populations and Samples

Population:

  • The entire group you want to study (e.g., all humans, all mice in a lab).

  • Studying the entire population is often impractical due to time, cost, or logistics.

Sample:

  • A subset of the population, selected to make inferences about the whole.

  • Must be representative to ensure accurate conclusions.

Descriptive Statistics

  • Summarize data to highlight key features:

  • Central Tendency: Where is the “center” of the data? (mean, median, mode)

  • Spread: How variable are the data? (variance, standard deviation)

  • Helps us understand the data before making inferences.

Central tendency

The central tendency of a series of observations is a measure of the “middle value”

The three most commonly reported measures of central tendency are the sample mean, median, and mode.

Mean Median Mode
The average value The middle value The most frequent value
Sum of the total divided by n The middle value (if n is odd). The average of the two central values (if n is even) The most frequent value
Most common reported measure, affected by outliers Less influenced by outliers, improves as n increases Less common

Mean

One of the simplest statistical models in biology is the mean

Lecturer Friends
Mark 5
Tony 3
Becky 3
Ellen 2
Phil 1
Mean 2.6
Median 3
Mode 3

Calculating the mean:

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \] \[\frac{5 + 3 + 3 + 2 + 1}{5} = 2.6\]

Sum all the values and divide by n of values

The mean

One of the simplest statistical models in biology is the mean

Lecturer Friends
Mark 5
Tony 3
Becky 3
Ellen 2
Phil 1
Mean 2.6
Median 3
Mode 3

We already know this is a hypothetical value as you can’t have 2.6 friends (I think?)

Now with any model we have to know how well it fits/ how accurate it is

Symmetrical

If data is symmetrically distributed, the mean and median will be close, especially as n increases.

This is also know as a normal distribution

Variance & Standard Deviation

Variance:

The average of the squared differences from the mean.

\[s{^2}_{sample} = \frac{\sum(x - \bar{x})^2}{N -1}\] Higher variance = more spread.

Standard Deviation (SD):

  • The square root of variance.

  • Easier to interpret because it’s in the same units as the data.

Calculating variance

Symmetrical sides

Lecturer Friends Residuals Sq Resid
Mark 5 2.4 5.76
Tony 3 0.4 0.16
Becky 3 0.4 0.16
Ellen 2 -0.6 0.36
Phil 1 -1.6 2.56
Mean 2.6
  • Sum of Squared Residuals = 9

  • N-1 = 4

  • Variance = 9/4 = 2.25

Standard Deviation

  • Square root of sample variance

  • A measure of dispersion of the sample

  • Smaller SD = more values closer to mean, larger SD = greater data spread from mean

  • variance:

\[ \sigma = \sqrt{\sum(x - \overline x)^2\over n - 1} \]

N-1?

Sampling

For a population the variance \(S_p^2\) is exactly the mean squared distance of the values from the population mean

\[ s{^2}_{pop} = \frac{\sum(x - \bar{x})^2}{N} \]

But this is a biased estimate for the population variance

  • A biased sample variance will underestimate population variance

  • n-1 (if you take a large enough sample size, will correct for this)

More here

Standard deviation - our example

Lecturer Friends Diff Squared diff
Mark 5 2.4 5.76
Tony 3 0.4 0.16
Becky 3 0.4 0.16
Ellen 2 -0.6 0.36
Phil 1 -1.6 2.56
Mean 2.6
variance 2.25
SD 1.5

Standard Deviation

Small \(s\) = data points are clustered near the mean

Large \(s\) = data points are widely dispersed around the mean

Why the Normal Distribution matters

Shape: Symmetrical, bell-shaped curve

  • Described by just two parameters: mean (μ) and standard deviation (σ).

  • Rule of Thumb: For normally distributed data:

    ~68% within 1 SD

    ~95% within 2 SDs

    ~99.7% within 3 SDs

Relevance: Many statistical tests assume data follows a normal distribution. This helps us calculate probabilities—like p-values—to test hypotheses.

Why the Normal Distribution matters

If we assume a normal distribution (or close enough), we can calculate the probability of observing any given value using just the mean and standard deviation.

\[ f(x) = \frac{1}{\sqrt{2\pi} \, \sigma} \exp\!\Biggl(-\frac{(x - \mu)^2}{2\,\sigma^2}\Biggr) \]

This has applications in hypothesis testing and building confidence intervals

Visualising a Distribution

Histograms plot frequency/density of observations within bins

Quantile-Quantile plots plot quantiles of a dataset vs. quantiles of a theoretical (usually normal) distribution

Z-Scores and the Standard Normal Distribution

Problem: Different datasets have different means and standard deviations.

Solution: Standardization allows comparisons by converting any normal distribution into a standard normal distribution (mean = 0, SD = 1).

\[ Z = \frac{X - \mu}{\sigma} \]

  • Z: How many standard deviations a value is from the mean.

  • X: Observed value.

  • μ: Population mean.

  • σ: Population standard deviation.

Why do Z scores matter?

  • The standard normal distribution allows us to calculate probabilities of observing extreme values.

  • Example: If 𝑍 = 2, the observation is 2 standard deviations above the mean.

Using a Z-table, we can find the probability of getting a result at least this extreme.

This concept extends to p-values, which tell us how rare our data is under the null hypothesis.

P-values

What is a P-value?

A p-value is the probability of obtaining results at least as extreme as the ones we observed, assuming the null hypothesis is true.

  • It helps quantify how surprising or unusual our data is under the null hypothesis.

  • A low p-value suggests that the observed data would be rare if the null hypothesis were true, which may lead us to question the null hypothesis

Example

Imagine flipping a coin 100 times, expecting about 50 heads if it’s fair (the null hypothesis). If you observe 90 heads, the p-value tells you how likely it is to get such an extreme result just by chance.

Hypothesis formation

  • Null hypothesis: Assumes there is no difference between groups or no relationship between variables. It represents the “status quo” or baseline expectation.

Example: A new drug has no effect compared to a placebo.

  • Alternative hypothesis: Assumes there is a difference or relationship.

Example: The new drug does improve patient outcomes compared to the placebo.

Null Hypothesis Probability Distribution

  • The shaded areas in the tails represent extreme outcomes (typically the most unexpected 5% if using an \(\alpha\) = 0.05).

  • If your observed data falls within these tails, it’s considered statistically significant, suggesting it’s unlikely to occur by random chance under the null hypothesis.

Impact of Sample Size on Distributions

  • Key Point: Larger sample sizes reduce variability, making it easier to detect small differences as statistically significant.

  • However, statistical significance doesn’t always mean practical importance—especially with large samples.

Hypothesis testing

  • Scenario: Testing if diet A and diet B affect mice longevity.

  • Null Hypothesis (H₀): No difference in longevity between diets.

  • Alternative Hypothesis (H₁): There is a difference in longevity.

lm(longevity ~ diet, data = mice)

Explanation: This linear model performs a t-test to assess if the mean lifespans differ significantly between diets.

Visualising the T Distribution

  • The t-distribution shows where your observed mean difference falls relative to what’s expected under the null hypothesis.

  • Confidence Intervals (CI): The red area marks the 95% CI. If zero falls outside this range, it suggests statistical significance.

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     4.19     0.276     15.2  3.87e-13     3.62     4.77 
2 dietmice_b     -1.41     0.391     -3.60 1.60e- 3    -2.22    -0.595

Writing About Statistical Results

“Diets can change the longevity of mice (p = 0.0016).”

  • “Mice on diet B lived significantly shorter lives than mice on diet A (t22 = -3.6, p = 0.006).”

  • “Mice on diet B had a reduced mean lifespan of 1.41 years[95% CI; -0.595:-2.22] from the mice on diet A (mean 4.19 years (95% CI, 3.62-4.77). While statistically significant (t22 = -3.6, p = 0.006), this is a relatively small sample size, and further testing is recommended to confirm this effect.”

  • Which has the greatest level of useful detail?

Introduction to Standard Error

Standard error measures the accuracy of a sample mean as an estimate of the population mean.

\[ SE = {s\over \sqrt n} \]

Data sampling

Week Three: Inferential Statistics

Standard Error

Standard deviation vs. standard error:

  • Standard deviation measures variability within a sample

\(s = \sqrt{\sum(x - \overline x)^2\over n - 1}\)

  • Standard error is how much the sample mean (from a given sample) is likely to vary from the true population mean.

\(SE(\overline{X}) = \frac{s}{\sqrt(N)}\)

Standard Error

  • Standard Error of the Mean (\(\overline{X}\)) is the **Standard Deviation of the Sampling Distribution of the Mean.

  • If you took many samples of size 𝑛from the same population, each sample would yield a different sample mean.

Standard Error

  • All these sample means form their own distribution: the sampling distribution of the mean.

  • The Standard Error measures the width (standard deviation) of this distribution of sample means.

  • Larger 𝑛→ smaller SE → sample means cluster more tightly around the true mean.

Central Limit Theoreom (CLT)

  • For large 𝑛the distribution of sample means becomes approximately normal, no matter the original data’s shape (assuming finite variance & independence).

  • Lets us use normal-based methods (like confidence intervals and hypothesis tests) even if the original data are not normal—provided 𝑛 is large.

Shiny App demonstrating CLT

Small Samples

What is a sufficient 𝑛?

  • When the data is very close to a normal distribution it might be very small c.10

  • When the data is slightly skewed it might be 𝑛 = 30

  • When the data follows an alternative distribution it could be very large

Bootstrapping

  • Bootstrapping: Resample from your data many times to estimate the distribution (and SE) of the Mean (\(\overline{X}\)) without assuming normality.

# Perform bootstrapping
for(i in 1:1000) {
  # Resample with replacement
  resample <- sample(my_data, size = length(my_data), replace = TRUE)
  # Compute the statistic (mean here)
  boot_means[i] <- mean(resample)
}

Parametric fits

If you know the data follows a specific distribution (e.g. Poisson, Binomial, etc) we can base our inference on these - see Generalised Linear Models

Are two samples distributions from the same population?

  • Even if two samples are both drawn from the same underlying population, they won’t have identical means.

  • Question: When is that difference “big enough” to claim they come from different populations?

Standard Error vs. Standard Error of the Difference

Standard Error

Measures the spread (uncertainty) of one sample mean around its population mean.

Standard Error of the Difference (SED)

\(Mean~difference = \overline x_1 - \overline x_2\)

\(SED = \sqrt{s_1^2\over n_1}+{s_2^2\over n_2}\)

Standard Error of the difference is calculated from two \(s\) so will always be larger than a single population estimate

Two-sample t-tests

  • Null Hypothesis : The two samples do not differ in their population means

  • Alternative Hypothesis : The two samples come from populations with different means

Even with identical populations, sample means differ. The t-test checks if that difference is too large to be due to chance.

Different enough?

How different is “different enough”?

When can we determine that two sample means are different enough? To have come from different populations?

Two-sample t-test

  • Compare the means of two samples

  • Null hypothesis: there is no difference between the population means

  • t-statistic: based on difference in sample means, sd & n

  • For a given sample size and t-value, how unlikely is to have two random samples at least this different IF both samples are drawn from identical/same populations.

T-test in Action

Example: A botanist is studying the mass of pollen transported by individual bees from two different hives (Hive A and Hive B).

  • Hive A: 62 bees, (mean pollen mass = 31.4 mg, variance = 225.0 mg^2)

  • Hive B: 67 bees, (mean pollen mass = 36.7 mg, variance = 186.0 mg^2)

Is there a significant difference in pollen mass transported per bee for the two hives studied?

Variance \(s^2\) = Sum of squared residuals/ n

Standard deviation (s) = \(\sqrt Variance\)

Standard error of the difference

Calculate the difference in means, and the standard error of the difference in means

Mean Difference = 36.7 - 31.4

Mean Difference = 5.3

\[ SED = \sqrt{s^2_1\over n_1} + {s^2_2\over n_2} \]

\[ SED = \sqrt{225\over 62} + {186\over 67} \]

\[ SED = \sqrt{3.63} + {2.78} \]

\[ SED = 2.53 \]

Standard error of the difference

Mean difference = 5.3

Calculate 95% confidence intervals

Assuming a normal distribution this would be 1.96 * SED = 4.96

Assuming a t distribution = ( \(\alpha\) = 0.975, df = 129-2, SED = 2.53 ) = 5.008

95% CI range = 5.3 [± 5.008] = 0.292 - 10.308

\[ SED = 2.53 \]

T distributions

Confidence intervals

Calculate 95% confidence intervals

95% CI range = 5.3 [± 5.008] = 0.292 - 10.308

  • Describe the estimated mean difference

The mass of pollen transported per bee in Hive B is on average 5.3mg more than Hive A [0.29 - 10.3] (mean [± 95% CI])

The mass of pollen transported per bee in Hive B is on average 5.3mg more than Hive A [0.29 - 10.3] (mean [± 95% CI])

The 95% CI calculated from the t distribution doesn’t include 0 → likely a real difference between the two hive populations.

“At P ≤ 0.05 bees from Hive B transport at least 0.29mg of pollen more per bee than Hive A”

t-test in R

hive_data %>% 
  t_test(pollen_mass ~ hive,
         var.equal = T)

    Two Sample t-test

data:  hive_a and hive_b

t = 2.09, df = 127, p-value = 0.01152

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:
0.294166 10.30808

We can also produce t-tests (and many other tests besides with the lm() function in R)

Common misundertandingsabout P values

A p-value is NOT:

  • The probability that the null hypothesis is true.

  • The probability your results occurred “by chance.”

  • Proof of a meaningful or large effect.

What it IS:

  • A measure of how surprising your data is under the assumption the null hypothesis is true.

Common misconceptions about P-values

Evidence of P-hacking

A non-significant p-value DOES NOT mean the null hypothesis is true

In reality, we can’t know for sure if a true mean difference exists.

For illustration: Assume we could know the true mean difference.

The figure shows:

Grey line: Expected data if the null hypothesis is true.

Black line: Expected data if the alternative hypothesis is true.

A p-value shows how surprising the data are if the null is true.

A low p-value is evidence against the null, not proof of the alternative.

Why a significant p-value does not mean the null hypothesis is false

What we can conclude, based on our data, is that we have observed an extreme outcome, that should be considered surprising. But such an outcome is not impossible when the null-hypothesis is true.

Why a significant p-value is not the same as an important effect

If we plot the null model for a very large sample size, we can see that even very small mean differences will be considered ‘surprising’.

However, just because data is surprising, does not mean we need to care about it. It is mainly the verbal label ‘significant’ that causes confusion here – it is perhaps less confusing to think of a ‘significant’ effect as a ‘surprising’ effect.

Linear models

Linear models cover multiple data types

Difference tests: t-test, ANOVA, ANCOVA

Difference model

Association tests: Regressions

Regression model

Basic linear models

Basic linear models aim to describe a linear relationship between a response (outcome) variable and a predictor (input) variable, usually by the method of ordinary least squares

Basic linear models aim to describe a linear relationship between a:

  • response (dependent) variable and predictor(s) (independent) variable.

  • usually by the method of ordinary least squares.

Linear Regression Equation

\(Y = \beta0 + \beta_1x1 + \epsilon\)

Where:

\(y\) is the predicted value of the response variable

\(\beta0\) is the intercept (value of y when x = 0)

\(\beta_1x1\) is the slope of the regression line

\(\epsilon\) is the error term

regression

Example

Difference tests: t-test, ANOVA, ANCOVA


Call:
lm(formula = longevity ~ type, data = dros_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-31.74 -13.74   0.26  11.44  33.26 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       63.560      3.158  20.130  < 2e-16 ***
typeInseminated    0.520      3.867   0.134    0.893    
typeVirgin       -15.820      3.867  -4.091 7.75e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.79 on 122 degrees of freedom
Multiple R-squared:  0.2051,    Adjusted R-squared:  0.1921 
F-statistic: 15.74 on 2 and 122 DF,  p-value: 8.305e-07

Association tests: Regressions


Call:
lm(formula = longevity ~ thorax, data = dros_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.415  -9.961   1.132   9.265  36.812 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -61.05      13.00  -4.695  7.0e-06 ***
thorax        144.33      15.77   9.152  1.5e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.6 on 123 degrees of freedom
Multiple R-squared:  0.4051,    Adjusted R-squared:  0.4003 
F-statistic: 83.76 on 1 and 123 DF,  p-value: 1.497e-15

Combining variables and omitted variable bias

Now we have two (or more) reasonable predictors of longevity:


Call:
lm(formula = longevity ~ thorax + type + sleep, data = dros_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.153  -6.836  -2.191   7.196  29.046 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -56.04502   11.17882  -5.013 1.87e-06 ***
thorax          144.43008   13.11616  11.012  < 2e-16 ***
typeInseminated   3.62796    2.77122   1.309    0.193    
typeVirgin      -13.24603    2.76198  -4.796 4.70e-06 ***
sleep            -0.05281    0.06383  -0.827    0.410    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.23 on 120 degrees of freedom
Multiple R-squared:  0.6046,    Adjusted R-squared:  0.5914 
F-statistic: 45.88 on 4 and 120 DF,  p-value: < 2.2e-16

\(Y = \beta0 + \beta_1x1 +\beta_2x2 + \beta_3x3 + \epsilon\)

Line of best fit

Line fitting

Ordinary Least Squares

The line of best fit minimises the sum of the squared distance that each sample point is from the value predicted by the model

This is called the method of

Ordinary Least Squares

Residuals

The difference between the ACTUAL value of the observation \(y_i\) and the value that the model predicts \(\hat{y_i}\) at that \(x\) value are the residuals (residual error).

residuals

The regression model fitted by Ordinary Least Squares (OLS) will produce the equation for the line that MINIMIZES the sum of squares of the residuals

Sum of Squares Errors

sum of squares error

This produces the line with the smallest total area size for the squares.

\[ SSE = \underset{i=1}{n \atop{\sum}}(y_i - \hat{y_i})^2 \]

SUM OF RESIDUAL ERROR = \((2)+(-1)+(4)+(-3)+(-2)=0\)

SUM OF SQUARES OF RESIDUAL ERROR = \((2^2)+(-1^2)+(4^2)+(-3^2)+(-2^2)=34\)

Coefficient of determination

\(\LARGE{R^2}\) : The proportion of variation in the dependent variable that can be predicted from the independent variable

\[ \LARGE{R^2=1-{Sum~of~Squared~Distances~between~y_i~and~\hat{y_i}\over{Sum~of~Squared~Distances~between~y_i~and~\overline{y}}} } \]

\[ \LARGE{R^2=1-{SSE\over{SST}}} \]

Assumptions of the Linear Model

  • Errors should be normally distributed

  • Homoscedasticity of the errors

  • Independent observations

  • A Linear relationship

Model fit checks

performance::check_model(dros_model, 
                         detrend = F) 

Linearity

performance::check_model(dros_model,
                         check = "linearity") 

Normality

performance::check_model(dros_model,
                         check = "normality")
performance::check_normality(dros_model)

OK: residuals appear as normally distributed (p = 0.065).

Heteroscedasticity

performance::check_model(dros_model,
                         check = "homogeneity")
performance::check_heteroscedasticity(dros_model)

OK: Error variance appears to be homoscedastic (p = 0.155).

Outliers

performance::check_model(dros_model,
                         check = "outliers")
performance::check_outliers(dros_model)

OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)

Questions?

Interpreting Models

R model summary provides, the formula of the regression, the estimate of the intercept and standard error, estimated differences and uncertainity for each slope, the degrees of freedom for the whole model, F value and R squared

Reporting

# A tibble: 5 × 7
  term            estimate std.error statistic  p.value conf.low conf.high
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     -56.0      11.2       -5.01  1.87e- 6  -78.2    -33.9   
2 thorax          144.       13.1       11.0   6.45e-20  118.     170.    
3 typeInseminated   3.63      2.77       1.31  1.93e- 1   -1.86     9.11  
4 typeVirgin      -13.2       2.76      -4.80  4.70e- 6  -18.7     -7.78  
5 sleep            -0.0528    0.0638    -0.827 4.10e- 1   -0.179    0.0736
  • The effect of thorax is statistically significant and positive (beta = 144.43, 95% CI [118.46, 170.40], t(120) = 11.01, p < .001; Std. beta = 0.64, 95% CI [0.52, 0.75])

    • The effect of type [Inseminated] is statistically non-significant and positive (beta = 3.63, 95% CI [-1.86, 9.11], t(120) = 1.31, p = 0.193; Std. beta = 0.21, 95% CI [-0.11, 0.52])

    • The effect of type [Virgin] is statistically significant and negative (beta = -13.25, 95% CI [-18.71, -7.78], t(120) = -4.80, p < .001; Std. beta = -0.75, 95% CI [-1.07, -0.44])

  • The effect of sleep is statistically non-significant and negative (beta = -0.05, 95% CI [-0.18, 0.07], t(120) = -0.83, p = 0.410; Std. beta = -0.05, 95% CI [-0.16, 0.07])

ANOVA

Model summary

In this example of a simple linear model, we run the equivalent to a Student’s t-test.


Call:
lm(formula = longevity ~ thorax + type + sleep, data = dros_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.153  -6.836  -2.191   7.196  29.046 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -56.04502   11.17882  -5.013 1.87e-06 ***
thorax          144.43008   13.11616  11.012  < 2e-16 ***
typeInseminated   3.62796    2.77122   1.309    0.193    
typeVirgin      -13.24603    2.76198  -4.796 4.70e-06 ***
sleep            -0.05281    0.06383  -0.827    0.410    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.23 on 120 degrees of freedom
Multiple R-squared:  0.6046,    Adjusted R-squared:  0.5914 
F-statistic: 45.88 on 4 and 120 DF,  p-value: < 2.2e-16

Q. What happens when we have more than two groups in our predictor variable? Why shouldn’t we just do lots of t-tests?

ANalysis Of VAriance (ANOVA)

ANOVAs use information about variances, but the main goal of analysis is comparison of MEANS (don’t let the name mix you up - more on this later).

ANOVA is an omnibus test – it tests for significant differences between any means in the study

ANOVA is just a special case of the linear model

ANOVA Hypotheses:

Null Hypothesis (H0): All means are equal (in other words, all groups are from populations with the same mean)

Alternative Hypothesis (HA): At least two group means are NOT equal (that means that just two could be different, or they could ALL be different)

ANOVA Example

We have collected data for soil uranium concentrations at three locations on Los Alamos National Lab property: Site A, Site B, and Site C. The data structure is shown here:

A one-way ANOVA can be used to assess whether there is a statistically significant difference in uranium concentration in soil at three locations

A tidy dataframe illustrating one continuous dependent variable and and one factor predictor variable (with three levels)

One-way ANOVA (single factor)

What does this one-way/single-factor refer to?

There is a single factor (variable), with at least 3 levels, where we are trying to compare means across the different levels of that factor.

ANOVA does this indirectly by looking at total between and within group variances as a ratio (F).

Sums of Squares

OLS fits a line to produce the smallest amount of SSE

SST The squared differences between the observed dependent variable \(y_i\) and its overall mean \(\overline{y}\).

SSR The sum of the differences between the predicted value \(\hat{y_i}\) of the dependent variable and its overall mean \(\overline{y}\).

SSE The error is the difference between the observed dependent value \(y_i\) and the predicted value \(\hat{y_i}\).

\[ SSE = \underset{i=1}{n \atop{\sum}}(y_i - \hat{y_i})^2 \]

\[ SSR = \underset{i=1}{n \atop{\sum}}(\hat{y_i} - \overline{y})^2 \]

\[ SST = \underset{i=1}{n \atop{\sum}}(y_i - \overline{y})^2 \]

where:

\(y_i\) = Observed value

\(\hat{y_i}\) = Value estimated by model

\(\overline{y}\) = The Grand Mean

A tidy dataframe illustrating one continuous dependent variable and and one factor predictor variable (with three levels)

\(SSE + SSR = SST\)

What does an ANOVA actually do?

\[ \large F = {SSR / (k-1)\over SSE / (N-k)} = {MSR\over MSE} \]

\(k\) = Total number of groups

\(N\) = numerator degrees of freedom = Total number of observations across all groups

\(N-k\) = denominator degrees of freedom

\(MSR\) = Mean Squares Regression

\(MSE\) = Mean Squares Error

This is a ratio of the between group variance and the the within group variance.

F distribution

The F-value or ratio of variances, over their respective degrees of freedom will have an F-distribution.

This F distribution is used when we want to compare within and between group variances.

The curve of the distribution depends on the degrees of freedom, and it is always positively skewed

Significance testing

  • The higher the F-value the greater the signal-to-noise ratio.

  • For a given value of numerator and denominator degrees of freedom we can look up the probability of observing this ratio under a null hypothesis of identical variances.

  • If F value is high enough then we might have enough evidence to conclude that samples are likely drawn from populations with different means.

Ask a question about: ANOVA

ANOVA Example

dros_model2 <-lm(longevity ~ thorax + sleep, data = dros_clean)

anova(dros_model2, dros_model)
Analysis of Variance Table

Model 1: longevity ~ thorax + sleep
Model 2: longevity ~ thorax + type + sleep
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    122 22701                                  
2    120 15125  2    7576.9 30.058 2.617e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(dros_model, test = "F")
Single term deletions

Model:
longevity ~ thorax + type + sleep
       Df Sum of Sq   RSS    AIC  F value    Pr(>F)    
<none>              15125 609.47                       
thorax  1   15282.8 30407 694.77 121.2556 < 2.2e-16 ***
type    2    7576.9 22701 656.23  30.0578 2.617e-11 ***
sleep   1      86.3 15211 608.18   0.6846    0.4097    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-hoc vs. Planned contrasts

Correcting for multiple comparisons

  • The F-ratio tells us only whether the model fitted to the data (SSR) accounts for more variance than other factors (SSE).

  • So if the F-ratio is large enough to be statistically significant, then we only know that one or more differences between means are statistically significant.

  • Further testing is needed!

Planned contrasts

  • You focused on a few scientifically sensible comparisons, rather than every possible comparison

  • You chose these as part of the experimental design before collecting data

  • Could structure linear model to reflect this?

Post hoc

  • Typically unplanned comparisons, conducted only after a significant ANOVA result

  • All combinations checked

  • Needs correcting for inflated Type 1 error

Post hoc

Method Equation Notes
Bonferroni/Dunn-Sidak \(p = {\alpha \over k}\) Correct critical p for the number of independent tests
Holm’s \(p_{1} < \alpha/(k–1+1) = \alpha/k\) we start with the smallest p-value (i = 1) and determine whether there is a significant result (i.e. p1 < α/(k–1+1) = α/k. If so we move on to the second test. We continue in this manner until we get a non-significant result
Tukey HSD \(t_{crit} = q * \sqrt{MSE \over N}\) Essential a t-test, correcting for multiple comparison, q-values can be looked up for test df and number of treatments

Estimated means

means <- emmeans::emmeans(dros_model,
                          specs = ~ type)

means
 type        emmean   SE  df lower.CL upper.CL
 Control       61.3 2.26 120     56.8     65.8
 Inseminated   64.9 1.59 120     61.8     68.1
 Virgin        48.0 1.59 120     44.9     51.2

Confidence level used: 0.95 

Estimated mean differences

means <- emmeans::emmeans(dros_model, 
                 specs = pairwise ~ type)

means
$emmeans
 type        emmean   SE  df lower.CL upper.CL
 Control       61.3 2.26 120     56.8     65.8
 Inseminated   64.9 1.59 120     61.8     68.1
 Virgin        48.0 1.59 120     44.9     51.2

Confidence level used: 0.95 

$contrasts
 contrast              estimate   SE  df t.ratio p.value
 Control - Inseminated    -3.63 2.77 120  -1.309  0.3929
 Control - Virgin         13.25 2.76 120   4.796  <.0001
 Inseminated - Virgin     16.87 2.25 120   7.508  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates 

Estimated CI of the difference

means <- emmeans::emmeans(dros_model, 
                 specs = pairwise ~ type)

confint(means)
$emmeans
 type        emmean   SE  df lower.CL upper.CL
 Control       61.3 2.26 120     56.8     65.8
 Inseminated   64.9 1.59 120     61.8     68.1
 Virgin        48.0 1.59 120     44.9     51.2

Confidence level used: 0.95 

$contrasts
 contrast              estimate   SE  df lower.CL upper.CL
 Control - Inseminated    -3.63 2.77 120   -10.20     2.95
 Control - Virgin         13.25 2.76 120     6.69    19.80
 Inseminated - Virgin     16.87 2.25 120    11.54    22.21

Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates 

Predictions

dros_model <- lm(longevity ~ thorax + type + sleep, data = dros_clean)
emmeans::emmeans(dros_model,
                 specs = ~ thorax + type,
                 at =list(thorax = seq(.64, .94, by = .02))) %>% 
  as_tibble() 

Turning a hypothesis into a model

Turning a question into a model

  • Hypotheses in research predict relationships between variables.

  • Linear models help test these hypotheses by quantifying these relationships.

  • Converting hypotheses into linear models involves identifying dependent and independent variables.

  • Considering and including important covariates

  • Determining whether interaction terms need to be included

Simple linear model

  • Question: Does the amount of sunlight affect plant growth?

  • Hypothesis: More sunlight hours increase plant growth

  • Linear Model: height ~ sunlight_hours

Control variables

  • Question: Does the amount of sunlight affect plant growth?

  • Hypothesis: More sunlight hours increase plant growth

  • Control variables: Soil quality

  • Linear Model: height ~ sunlight_hours + soil quality

Multiple predictor model

  • Question: How do water and nutrient levels affect plant growth?

  • Hypothesis: Both increased water and nutrient levels lead to increased plant growth.

  • Control Variables: Temperature

  • Linear Model: growth ~ water + nutrient + temp_c

Multiple predictor model

  • Question: Do water and nutrient levels work together to affect plant growth?

  • Hypothesis: Increasing water and nutrient levels has a combined effect that leads to increased plant growth, greater than the effect of either on their own.

  • Control Variables: Temperature

  • Linear Model: growth ~ water + nutrient + temp_c + water:nutrient

Exercise 1

  • Question: Does temperature affect the metabolic rate of lizards?

  • Hypothesis: Temperature increases the metabolic rate of lizards

  • Control Variables: Age of lizards

  • Linear Model: metabolic_rate ~ temp + age

Exercise 2:

  • Question: How does diet and exercise affect cholesterol in mice?

  • Hypothesis: low fat diets and regular exercise decrease cholesterol levels in mice

  • Control Variables: Genetic strain of mice

  • Linear Model: cholesterol ~ diet_fat + daily_exercise_hours + strain

  • Linear Model: cholesterol ~ diet_fat + daily_exercise_hours + strain + diet_fat:daily_exercise

Exercise 3:

  • Question: Do immune system responses depend on stress and sleep in humans?

  • Hypothesis: Negative effects of stress on immune responses are lessened if sleep duration remains high

  • Control Variables: Age

  • Linear Model: immune_response ~ sleep_hours + stress + age + sleep_hours:stress

Questions?

Resources

Additional resources

  • Discovering Statistics - Andy Field

  • Happy Git

  • An Introduction to Generalized Linear Models - Dobson & Barnett

  • An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani

  • Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.

  • Ecological Statistics with contemporary theory and application

  • The Big Book of R (https://www.bigbookofr.com/)

  • British Ecological Society Guides to Better Science

*(SORTEE)

Reading list

Reading list